Floppyshop 2

home *** CD-ROM | disk | FTP | other *** search

/ Floppyshop 2 / Floppyshop - 2.zip / Floppyshop - 2.iso / diskmags / 5791-.end / dmg-6143 / lza_rept / compres3.txt < prev next >

Wrap

Text File | 1996-06-04 | 9KB | 182 lines

Experiments in Data Compression III (or "The Path Not Taken") by John Kormylo E.L.F. Software State of the Art In the following table are shown the results for compressing three different files using three different methods. 'ZIP' refers to STZIP 2.5 (deflating). 'LZA1' refers to the LZA method used by ELFBACK v2.9. 'LZA2' refers to the LZA method used by ELFARC v1.3. | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ ZIP | 25134 | 32417 | 496876 | LZA1 | 24860 | 34719 | 585185 | LZA2 | 22029 | 31997 | 453143 | -------+--------------+-------------+--------------+ Back to Basics Huffman and Arithmetic coding are fixed length to variable length codes. They take advantage of the fact that some symbols occur more often than others, and give them shorter codes than the less common symbols. LZW is a variable length to fixed length code, which attempts to create a dictionary of very long sequences which all have the same probability of occurrence. Whenever a sequence is used, the algorithm attempts to make it longer. Previous experience indicates that a variable length to variable length code is best. There are certain sequences which occur relatively often, but not necessarily together. For text this should be obvious in that words are used many times but usually in different combinations. Therefore making the sequence longer is often not nearly as useful as reducing the number of bits needed to code it. LZA codes sequences using a length code (based on the relative occurance of matches of that length) followed by an alphabetical order code (based on how many matches of that length were found in the dictionary). Whether to use one long match or two short matches was determined by calculating the resulting number of bits for all possible combinations. One would be hard pressed to devise a more effective coding system. The problem with LZA is that it is slow. In part this is due to the binary tree search, which must perform many string comparisons to locate the first and last matches corresponding to a specified length. This is also due to the large dictionary of possible sequences, from which the best combination of sequences must be determined. There are 16K text pointers and 64 possible lengths, for a maximum of 1 Meg possible strings in the dictionary (only about half that number are unique). A smaller dictionary would require less time to search and might make the search for the best coding sequence unnecessary. Hash Tables Hash tables are faster than binary search trees. Also, each entry in the table corresponds to a particular length of string, and is the only entry corresponding to that string. Using the index of the dictionary entry corresponds to a variable to fixed length code (such as LZW). For a variable to variable length code, one would also need to keep a count of how many times that string was used. Because of the 16 bit limit for the arithmetic coder, this means that the total number of counts must be less than 64K. Needless to say, the total number of strings in such a dictionary would have to be much less than 1 Meg, probably less than 32K. | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ LZW | 32341 | 45147 | 905744 | Count | 31399 | 43143 | 771499 | Prune | 30574 | 39608 | 647192 | Merge | 28090 | 36325 | 606602 | Smart | 25974 | 35426 | 567752 | -------+--------------+-------------+--------------+ The above table shows the results for several LZ78 variants. The first uses LZW for all sequences and arithmetic for single bytes. The second keeps track of the number of times each sequence is used (variable to variable length coding). Both use dictionaries with 8K entries. One of the problems with LZW is that once a string has been added to the dictionary, it can never be removed. Instead, the entire dictionary is thrown out periodically and the whole process begins again. However, it is possible to implement a hash table where entries can be removed. Also, one can set the count of an entry to 0 so that intermediate strings which are 'never' used will not affect the coding (although they will continue to take up space in the hash table). 'Prune' in the above table reduces the count on every entry whenever the table is full (or the total of the counts reaches 64K). By doing all of the entries at once, we avoid the problem of having to re-sort. LZ78 and LZW extend the strings in the dictionary by one character each time they are used. This has two problems associated with it. First, the strings grow very slowly. Second, in the process of adding a long string which is repeated often, one will also form lots of short strings from the leftovers at all the intermediate stages. A much more effective approach is to extend a string by every character in the following string when two strings occur concurently. (Note, this means you cannot add strings to the dictionary until they have been sent to the receiver.) The results for this method are shown as 'Merge' in the above table. It is not always best to choose the longest match available, since the end may not line up with another string in the dictionary. 'Smart' in the previous table shows what happens when you choose the match length in order to code the most characters using two consecutive matches. (Ties go to the match with the highest count.) The best news is that this approach is relatively fast. LZ79 (combining LZ77 & LZ78) It was observed that tweaking the algorithm produced very little change in the performance. Also, the results were similar to the 'backwards matching' algorithm discussed in the previous chapter. In other words, this is about as much as you can do with dictionary compression. The next logical thing to try is adding a 'FIFO buffer.' A 64 byte sliding window was added to the match search. Lengths 2-64 were coded as 257-319 in the arithmetic character set (256 is used for the hash table). Offsets were sent without weights. 'Long' shows the results using the longer of the two matches (with ties going to the hash table) and no length optimization (compares to 'Merge' rather than 'Smart'). This achieved a minor improvement in the first two files, and a major improvement in WORDLIST. | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -------+--------------+-------------+--------------+ Long | 28086 | 35397 | 522226 | Ties | 28829 | 35880 | 493510 | Best | 25785 | 34833 | 448995 | Tree | 25498 | 34410 | 465510 | Lags | 25397 | 34330 | 415494 | 16K | 24489 | 34218 | 415349 | -------+--------------+-------------+--------------+ 'Ties' show the results of the above method when ties go to the sliding window instead of the hash table. This demonstrates the need for a more intelligent method for chosing which match to use, since the results are highly data dependent. 'Best' shows the results of using a Viterbi like algorithm to determine the best match & length (the same one used in 'LZA2'). This method looks at every possible way to compress the next 64 bytes, at least until all paths have the same first step. Of course, this approach is very slow even with the hash table. 'Tree' uses a binary search tree together with the 'Best' algorithm. Instead of looking at the last 64 characters, it uses text pointers corresponding to the last 64 sequences coded. 'Lags' uses counts of how often a match was found based on how far it was from the location matched (sort of like autocorrelation data). As shown in the appendix of the previous report, these counts vary from file to file. Experiments with the buffer sizes found that the best results in general were for a 64 entry FIFO buffer and a 16K entry HASH table. These results are shown as '16K.' At this point we are able to beat ZIP on two of the three files. Unfortunately, nothing seems to help the third file much. Increasing the size of the FIFO buffer helps a little. The problem with this file is that it doesn't compress well at all, and the matches found tend to be very short. Exhaustion Unfortunately, I was unable to find any way to achieve a substantial inprovement over the above results. However, the methods developed during these experiments should be applicable to improving the speed using the old LZA algorithm.